Overview
- Reference genomes and GRC.
- Fasta and FastQ (Unaligned sequences).
- SAM/BAM (Aligned sequences).
- BED (Genomic Intervals).
- GFF/GTF (Gene annotation).
- Wiggle files, BEDgraphs and BigWigs (Genomic scores).
- VCF and MAF (Genomic variations).
class: inverse, center, middle
Reference Genomes
Are there we there yet?
- The human genome isnt complete!
- In fact, most model organisms’s reference genomes are being regularly updated.
- Reference genomes consist of mixture of known chromosomes and unplaced contigs called a Genome Reference Assembly.
- Major revisions to assembies result in change of co-ordinates.
- Requires conversion between revisions.
- The latest genome assembly for humans is GRCh38.
- Patches add information to the assembly without disrupting the chromosome coordinates . i.e GRCh38.p3
Genome Reference Consortium.
- GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organims.
- Human - GRCh38.p3
- Mouse - GRCm38.p3
- Zebrafish - GRCz10
- Other model organisms are maintained separately.
- Drosophila - Berkeley Drosophila Genome Project, BDGP36
Why do we need to know about reference genomes
- Allows for genes and genomic features to be evaluated in their linear genomic context.
- Gene A is close to Gene B
- Gene A and Gene B are within feature C.
- Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.
Aligning to a reference genomes

A reference genome
- A reference genome is a collection of contigs.
- A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
- Typically comes in FASTA format.
- “>” line contains information on contig
- Lines following contain contig sequence
class: inverse, center, middle
Genomic Annotation.
Genomic Annotation
GFF
- Used to genome annotation.
- Stores position, feature (exon) and meta-feature (transcript/gene) information.
Genomic Annotation
- Chromosome
- Start of feature
- End of Feature
- Strand
Genomic Annotation
- Source
- Feature type
- Score
Genomic Annotation
- Column 9 contains key pairs (ID=exon01), separated by semi-colons “;”
- ID - Feature name.
- PARENT- Meta-feature name.
Genomic Variants
- Variant Call Format (VCF)
- Mutation Annotation Format (MAF)
MAF Structure
class: inverse, center, middle
Genomic Files for computing .
bigWig, bigBED and TABIX
- Many programs and browsers deal better with compressed, indexed versions of genomic files
- SAM -> BAM (.bam and index file of .bai)
- Wiggle and bedGraph -> bigWig (.bw/.bigWig)
- BED -> bigBed (.bb)
- BED, VCF and GFF -> (.gz and index file of .tbi)